Sparse Learning Based Linear Coherent Bi-clustering
نویسندگان
چکیده
Clustering algorithms are often limited by an assumption that each data point belongs to a single class, and furthermore that all features of a data point are relevant to class determination. Such assumptions are inappropriate in applications such as gene clustering, where, given expression profile data, genes may exhibit similar behaviors only under some, but not all conditions, and genes may participate in more than one functional process and hence belong to multiple groups. Identifying genes that have similar expression patterns in a common subset of conditions is a central problem in gene expression microarray analysis. To overcome the limitations of standard clustering methods for this purpose, Bi-clustering has often been proposed as an alternative approach, where one seeks groups of observations that exhibit similar patterns over a subset of the features. In this paper, we propose a new bi-clustering algorithm for identifying linear-coherent bi-clusters in gene expression data, strictly generalizing the type of bi-cluster structure considered by other methods. Our algorithm is based on recent sparse learning techniques that have gained significant attention in the machine learning research community. In this work, we propose a novel sparse learning based model, SLLB, for solving the linear coherent bi-clustering problem. Experiments on both synthetic data and real gene expression data demonstrate the model is significantly more effective than current biclustering algorithms for these problems. The parameter selection problem and the model’s usefulness in other machine learning clustering applications are also discussed. The on-line appendix for this paper can be found at http://www.cs.ualberta.ca/~ys3/SLLB.
منابع مشابه
Linear Coherent Bi-cluster Discovery via Beam Detection and Sample Set Clustering
We propose a new bi-clustering algorithm, LinCoh, for finding linear coherent bi-clusters in gene expression microarray data. Our method exploits a robust technique for identifying conditionally correlated genes, combined with an efficient density based search for clustering sample sets. Experimental results on both synthetic and real datasets demonstrated that LinCoh consistently finds more ac...
متن کاملLinear Coherent Bi-Clustering via Beam Searching and Sample Set Clustering
We propose a new bi-clustering algorithm, LinCoh, for finding linear coherent bi-clusters in gene expression microarray data. Our method exploits a robust technique for identifying conditionally correlated genes, combined with an efficient density based search for clustering sample sets. Experimental results on both synthetic and real datasets demonstrated that LinCoh consistently finds more ac...
متن کاملLinear Coherent Bi-cluster Discovery via Line Detection and Sample Majority Voting
Discovering groups of genes that share common expression profiles is an important problem in DNA microarray analysis. Unfortunately, standard bi-clustering algorithms often fail to retrieve common expression groups because (1) genes only exhibit similar behaviors over a subset of conditions, and (2) genes may participate in more than one functional process and therefore belong to multiple group...
متن کاملGreedy Minimization of Weakly Supermodular Set Functions
This paper defines weak-α-supermodularity for set functions. It shows that minimizing such functions under cardinality constrains is a common task in machine learning and data mining. Moreover, any problem whose objective function exhibits this property benefits from a greedy extension phase. Explicitly, let S∗ be the optimal set of cardinality k that minimizes f and let S0 be an initial soluti...
متن کاملBian, Xiao. Sparse and Low-rank Modeling on High Dimensional Data: a Geometric Perspective. (under the Direction of Dr. Hamid Krim.) Sparse and Low-rank Modeling on High Dimensional Data: a Geometric Perspective
BIAN, XIAO. Sparse and Low-Rank Modeling on High Dimensional Data: A Geometric Perspective. (Under the direction of Dr. Hamid Krim.) High dimensional data exhibits distinct properties compared to its low dimensional counterpart, which causes a common performance decrease and a formidable computational cost increase of traditional approaches. Novel methodologies are therefore needed to character...
متن کامل